fix(oauth): narrow context.lock scope in async_auth_flow by peisuke · Pull Request #2660 · modelcontextprotocol/python-sdk

peisuke · 2026-05-22T07:42:17Z

Narrows the lock scope inside OAuthClientProvider.async_auth_flow so the actual HTTP request yield runs outside any lock. Concurrent requests through the same provider — in particular, a POST issued while a GET SSE long-poll is in flight — no longer serialize on context.lock. A new refresh_lock preserves single-flight semantics for token refresh so the change does not introduce a refresh race.

Motivation and Context

OAuthClientProvider.async_auth_flow previously held self.context.lock for the entire flow body, including the response = yield request that hands the request off to httpx. For requests that complete quickly this is fine, but a Streamable HTTP transport opens a long-lived GET SSE stream for server-initiated messages alongside the POST stream for client-initiated RPCs. Both go through the same OAuthClientProvider instance, so the GET SSE long-poll's yield request holds context.lock for the entire SSE lifetime. Any concurrent tools/call POST then blocks on the lock until the SSE returns, producing a multi-second stall in OAuth-authenticated MCP connections.

This commit splits the single coarse lock into purpose-specific scopes:

Phase 1 (context.lock): initialize state, capture protocol_version, and decide whether a refresh is needed. Short-held; no HTTP I/O.
Phase 2 (refresh_lock, new): single-flight token refresh. The refresh yield happens outside any lock. A double-check inside context.lock ensures concurrent waiters do not redundantly refresh after another coroutine completed one.
Phase 3 (no lock): add the auth header and yield the actual request. GET SSE long-polls and other long-running requests no longer block unrelated traffic.
Phase 4 (context.lock): 401 / 403 full OAuth re-auth path. Conservatively kept under lock because this path is rare and its yielded sub-requests target the AS, not the resource server. A future change can narrow this further with the same pattern.

refresh_lock is added to OAuthContext (alongside the existing lock) so concurrent expirations trigger at most one refresh request. The double-check pattern inside refresh_lock makes the second waiter observe the freshly-updated token and skip its own refresh.

How Has This Been Tested?

Five new tests in tests/client/test_auth.py exercise the refactor:

test_concurrent_request_not_blocked_by_pending_long_running_request — the core regression test. Drives one async_auth_flow generator to its yield (simulating a GET SSE long-poll), then opens a second flow on the same provider and asserts it reaches its own yield within anyio.fail_after(5). Pre-fix this hangs indefinitely.
test_refresh_lock_double_check_skips_redundant_refresh — a flow started after another finished its refresh observes the fresh token in Phase 1 and skips Phase 2 entirely; no second refresh request is sent.
test_double_check_inside_refresh_lock_skips_second_refresh — monkeypatches is_token_valid to return False in Phase 1 and True in the Phase 2 double-check, exercising the branch where a second coroutine's refresh is correctly elided inside refresh_lock.
test_refresh_with_failed_status_clears_tokens — covers the failed-refresh path (status_code != 200) including the _initialized = False reset.
test_refresh_with_invalid_json_clears_tokens — covers the ValidationError branch when the refresh response body is malformed.

./scripts/test is clean: 1177 passed / 98 skipped / 1 xfailed, 100.00% coverage on src/mcp/client/auth/oauth2.py, strict-no-cover reports no leftover # pragma: no cover violations. Four previously-needed pragmas in _handle_refresh_response and the Phase 2 yield are now covered by the new tests and removed.

Breaking Changes

None. Public API is unchanged. Concurrency semantics narrow only — previously-serialized requests now proceed in parallel; token refresh remains serialized via refresh_lock.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update

Checklist

I have read the MCP Documentation
My code follows the repository's style guidelines
New and existing tests pass locally
I have added appropriate error handling
I have added or updated documentation as needed

Additional context

Targets main (v2 development). The same bug exists on v1.x; if a backport is wanted, that can land as a follow-up PR.

Fixes #1326

…poll requests Previously the entire `OAuthClientProvider.async_auth_flow` body ran under `self.context.lock`, including the `yield request` that hands the request off to httpx. For requests that complete quickly this is fine, but a GET SSE long-poll holds the lock for the full SSE lifetime — which means any concurrent POST (e.g. `tools/call`) is blocked waiting for the lock, producing a ~16s perceived stall on lazy MCP connections that use OAuth. This commit splits the single coarse lock into purpose-specific scopes: Phase 1 (context.lock): initialize state, capture protocol_version, and decide whether a refresh is needed. Short-held; no HTTP I/O. Phase 2 (refresh_lock, new): single-flight token refresh. The refresh request `yield` happens outside any lock. A double-check inside `context.lock` ensures concurrent waiters do not redundantly refresh after another coroutine completed one. Phase 3 (no lock): add the auth header and yield the actual request. GET SSE long-polls and other long-running requests no longer block unrelated traffic. Phase 4 (context.lock): 401 / 403 full OAuth re-auth path. Conservatively kept under lock because this path is rare and its yielded sub-requests (metadata discovery, registration, token exchange) hit the AS, not the resource server. A future refactor can narrow this further. Lock additions: - `OAuthContext.refresh_lock: anyio.Lock` provides single-flight refresh so concurrent requests trigger at most one token refresh. Behavior changes: - Concurrent requests through the same `OAuthClientProvider` no longer serialize at the lock. GET SSE long-polls and POSTs now proceed in parallel. - Token refresh remains serialized (via `refresh_lock`), preserving the invariant that only one refresh request is in flight at a time. - Public API and behavior are otherwise unchanged. Related upstream issue: modelcontextprotocol#1326

…ests Two tests in a new TestConcurrentRequestsDoNotDeadlock class exercise the behavior the previous commit fixes: 1. ``test_concurrent_request_not_blocked_by_pending_long_running_request`` drives one async_auth_flow generator to its yield (= simulating a GET SSE long-poll suspended waiting for the next event) and then opens a second concurrent flow on the same provider. The second flow must reach its own yield within a short timeout — i.e., the lock release between Phase 1 and Phase 3 lets it through. Pre-fix, the second generator would block on context.lock indefinitely. 2. ``test_concurrent_token_refresh_is_single_flight`` exercises the refresh_lock single-flight path. A first flow performs the refresh yield; a second flow started after the refresh completes observes the freshly-updated token in Phase 1 and proceeds directly to its own request yield without issuing a second refresh. Also: tighten the refresh_request unbound-after-conditional-write pattern in async_auth_flow so pyright recognizes it as definitely assigned at the yield site (was: derived from a boolean predicate; now: typed as ``httpx.Request | None`` and checked explicitly).

After dropping the function-level `# pragma: no cover` from `_handle_refresh_response` and removing the per-line pragmas from the refactored Phase 2, the strict-no-cover audit identified covered lines still marked pragma'd and surfaced previously-untested branches. Three new tests close the coverage gaps: * `test_refresh_with_failed_status_clears_tokens` — exercises the ``response.status_code != 200`` branch in `_handle_refresh_response` and the `self._initialized = False` reset on refresh failure. * `test_refresh_with_invalid_json_clears_tokens` — exercises the ValidationError branch when the refresh body is not valid JSON. * `test_double_check_inside_refresh_lock_skips_second_refresh` — uses monkeypatch to flip `is_token_valid` between Phase 1 (False) and the double-check inside `refresh_lock` (True), exercising the branch where a second coroutine's refresh is correctly elided. Also: convert the new tests from the legacy Test* class pattern to plain top-level `test_*` functions per AGENTS.md, and drop unneeded per-line `# pragma: no cover` markers in the refactored auth_flow. Coverage report: 100.00% on `src/mcp/client/auth/oauth2.py`, strict-no-cover clean.

On Python 3.10/3.11, coverage.py (sys.settrace backend) misattributes the False arm of compound boolean predicates inside an async with block in an async generator, leading to spurious partial-branch warnings on 4 lines in async_auth_flow: - line 536: `if not is_token_valid() and can_refresh_token():` (Phase 1) - line 546: same (Phase 2 double-check inside refresh_lock) - line 548: `if refresh_request is not None:` (re-check skip path) - line 553: `if not await _handle_refresh_response(...):` (refresh success path) Python 3.12+ (sys.monitoring) tracks these branches correctly. Add `# pragma: no branch` to the 4 lines as a workaround for the legacy backend only. An inline comment block above the Phase 1 if documents the rationale. Also add 2 unit tests that explicitly exercise the affected branches so the coverage drop is shown to be a tool-precision issue, not missing tests: - test_phase1_skips_refresh_when_token_valid: Phase 1 sees is_token_valid=True and skips Phase 2 entirely. - test_refresh_success_proceeds_to_phase3_without_resetting_initialized: _handle_refresh_response returns True (HTTP 200) so the _initialized reset is skipped and Phase 3 proceeds with the fresh token. Local verification (Python 3.10): pytest tests/client/test_auth.py -n auto + coverage report → 99.43% (was 98.30%, BrPart 1, Missing only line 206 which is outside the test_auth.py scope). Full suite should reach 100%.

Bartok9 · 2026-06-13T06:07:08Z

Re-confirmation + offer of a conflict-free rebase.

This fix is needed: #2847 just landed (June 12) as a second, independent production repro of the exact RuntimeError: The current task is not holding this lock this PR addresses — a gateway connecting to multiple OAuth MCP servers concurrently, where anyio.Lock (which records task identity at acquire() and enforces same-task release()) is held across the yield in async_auth_flow, so httpx driving the generator from a different task throws on release. The reporter explicitly asks for this PR to be reviewed/merged.

The PR currently shows CONFLICTING against main. The conflict is small and isolated to the Phase-1 refresh block (main still has the inline refresh under the held lock; this PR restructures it into Phase-1 decision / Phase-2 single-flight refresh outside the lock). I rebased the four commits onto current main (616476f-era) — the resolution is purely "take this PR's restructured refresh path" — and the result is green:

uv run pytest tests/client/test_auth.py -q
108 passed, 1 xfailed (0.29s)
uv run ruff check src/mcp/client/auth/oauth2.py tests/client/test_auth.py
All checks passed!

Conflict-free rebase (author/commits preserved): https://github.com/Bartok9/python-sdk/tree/salvage/2660-oauth-lock-rebase — happy to open it as a fresh PR if @peisuke is unavailable to push, or @peisuke can cherry-pick. The single-flight refresh_lock + re-check-under-lock pattern is the right shape; narrowing the lock so no HTTP yield (especially the long-poll GET SSE) runs while holding context.lock is exactly what unblocks the concurrent case.

peisuke added 3 commits May 21, 2026 01:33

peisuke marked this pull request as draft May 22, 2026 07:46

peisuke force-pushed the pr/oauth-lock-scope branch from 4f5e8e3 to 97b4bfa Compare May 27, 2026 13:47

peisuke force-pushed the pr/oauth-lock-scope branch from 97b4bfa to a09f68e Compare May 27, 2026 14:20

EnGassa mentioned this pull request Jun 12, 2026

Bug: anyio.Lock in async_auth_flow causes RuntimeError under concurrent OAuth MCP connections #2847

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(oauth): narrow context.lock scope in async_auth_flow#2660

fix(oauth): narrow context.lock scope in async_auth_flow#2660
peisuke wants to merge 4 commits into
modelcontextprotocol:mainfrom
peisuke:pr/oauth-lock-scope

peisuke commented May 22, 2026

Uh oh!

Bartok9 commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

peisuke commented May 22, 2026

Motivation and Context

How Has This Been Tested?

Breaking Changes

Types of changes

Checklist

Additional context

Uh oh!

Bartok9 commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants